Predicting Missing Attribute Values Using k-Means Clustering
نویسندگان
چکیده
Problem statement: Predicting the value for missing attributes is an important data preprocessing problem in data mining and knowledge discovery tasks. Several methods have been proposed to treat missing data and the one used more frequently is deleting instances containing at least one missing value of a feature. When the dataset has minimum number of missing attribute values then we can neglect the instances. But if it is high, deleting those instances may neglect the essential information. Some methods, such as assigning an average value to the missing attribute, assigning the most common values make good use of all the available data. However the assigned value may not come from the information which the data originally derived from, thus noise is brought to the data. Approach: In this study, k-means clustering is proposed for predicting missing attribute values. The performance of the proposed approach is analyzed with nine different methods. The overall analysis shows that the k-means clustering can predict the missing attribute values better than other methods. After assigning the missing attributes, the feature selection is performed with Bees Colony Optimization (BCO) and the improved Genetic KNN is applied for finding the classification performance as discussed in our previous study. Results: The performance is analyzed with four different medical datasets; Dermatology, Cleveland Heart, Lung Cancer and Wisconsin. For all the datasets, the proposed k-means based missing attribute prediction achieves higher accuracy of 94.60 %, 90.45 %, 87.51 % and 95.70 % respectively. Conclusion: The greater classification accuracy shows the superior performance of the k-means based missing attribute value prediction.
منابع مشابه
A Fuzzy C-means Algorithm for Clustering Fuzzy Data and Its Application in Clustering Incomplete Data
The fuzzy c-means clustering algorithm is a useful tool for clustering; but it is convenient only for crisp complete data. In this article, an enhancement of the algorithm is proposed which is suitable for clustering trapezoidal fuzzy data. A linear ranking function is used to define a distance for trapezoidal fuzzy data. Then, as an application, a method based on the proposed algorithm is pres...
متن کاملAn Improved K-Means with Artificial Bee Colony Algorithm for Clustering Crimes
Crime detection is one of the major issues in the field of criminology. In fact, criminology includes knowing the details of a crime and its intangible relations with the offender. In spite of the enormous amount of data on offenses and offenders, and the complex and intangible semantic relationships between this information, criminology has become one of the most important areas in the field o...
متن کاملFuzzy K-means clustering with missing values
Fuzzy K-means clustering algorithm is a popular approach for exploring the structure of a set of patterns, especially when the clusters are overlapping or fuzzy. However, the fuzzy K-means clustering algorithm cannot be applied when the real-life data contain missing values. In many cases, the number of patterns with missing values is so large that if these patterns are removed, then sufficient...
متن کاملEvaluating a Nearest-Neighbor Method to Substitute Continuous Missing Values
This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datasets formed by continuous attributes. In the substitution process, each instance containing missing values is compared with complete instances, and the closest instance is used to assign the attribute missing value. We evaluate this method in simulations performed in four datasets that are usually emp...
متن کاملAn Effective Attribute Clustering Approach for Feature Selection and Replacement
Feature selection is an important pre-processing step in mining and learning. A good set of features can not only improve the accuracy of classification, but also reduce the time to derive rules. It is executed especially when the amount of attributes in a given training data is very large. In this paper, an attribute clustering method based on genetic algorithms is proposed for feature selecti...
متن کامل